Load in packages
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5 ✓ purrr 0.3.4
✓ tibble 3.1.4 ✓ dplyr 1.0.7
✓ tidyr 1.1.3 ✓ stringr 1.4.0
✓ readr 2.0.1 ✓ forcats 0.5.1
── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(tidyverse)
library(ggplot2)
import data:
find missing data:
No missing data. Check data types of each variable:
str(house)
'data.frame': 21613 obs. of 22 variables:
$ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
$ date : chr "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
$ price : num 221900 538000 180000 604000 510000 ...
$ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
$ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
$ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
$ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
$ floors : num 1 2 1 1 1 1 2 1 1 2 ...
$ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
$ view : int 0 0 0 0 0 0 0 0 0 0 ...
$ condition : int 3 3 3 5 3 3 3 3 3 3 ...
$ grade : int 7 7 6 7 8 11 7 7 7 7 ...
$ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
$ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
$ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
$ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
$ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
$ lat : num 47.5 47.7 47.7 47.5 47.6 ...
$ long : num -122 -122 -122 -122 -122 ...
$ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
$ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
$ num : int 1 1 1 1 1 1 1 1 1 1 ...
We will definitely need to change the data type for the date column, and potentially look into creating factors for some of the more ordinal variables.
Convert date variabe to date type:
Turning view, condition, and grade into ordered factors:
Part 2: EDA
library(gridExtra)
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
names(house)
[1] "id" "date" "price" "bedrooms" "bathrooms" "sqft_living" "sqft_lot"
[8] "floors" "waterfront" "view" "condition" "grade" "sqft_above" "sqft_basement"
[15] "yr_built" "yr_renovated" "zipcode" "lat" "long" "sqft_living15" "sqft_lot15"
[22] "num"
Question: how to deal with indicator (ordinary) varibales in this case? Map to binary classes:



Map to binary classes and check distributions and interactions

Checking possible interactions after mapping categorical variables to a larger classes
##produce the 4 density plots in a 2 by 2 matrix
grid.arrange(sp1, sp2, sp3, sp4, ncol = 2, nrow = 2)
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'

Final check, same scatter plots but with log(price) - no visiable interaction with log price.
##produce the 4 density plots in a 2 by 2 matrix
grid.arrange(sp1, sp2, sp3, sp4, ncol = 2, nrow = 2)
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'

Quantitative pridictors:

Checking how many quantitative observations have 0 values
colSums(house[,quant_vars] == 0)
yr_built yr_renovated floors bedrooms bathrooms sqft_living sqft_lot sqft_above
0 20699 0 13 10 0 0 0
sqft_basement sqft_living15 sqft_lot15
13126 0 0
Probably some homes have no basements and thus zeros sqft_basement is okay, but all homes are expected to have non-zero number of bedrooms (13 zeros) and bathrooms (10 zeroz). Drop these rows:
colSums(house[,quant_vars] == 0)
yr_built yr_renovated floors bedrooms bathrooms sqft_living sqft_lot sqft_above
0 20683 0 0 0 0 0 0
sqft_basement sqft_living15 sqft_lot15
13110 0 0
Converting quantitative predictor floors to a factor 1, 2, 3.
grid.arrange(sp_floors, bp_floors, ncol = 2, nrow = 1)
`geom_smooth()` using formula 'y ~ x'

Computing age of the house and removing year_build and year_renovated
names(house)
[1] "price" "bedrooms" "bathrooms" "sqft_living" "sqft_lot" "floors" "waterfront"
[8] "view" "condition" "grade" "sqft_above" "sqft_basement" "sqft_living15" "sqft_lot15"
[15] "age"
Final set of quantitative vars:
hist.data.frame(house[,quant_vars])
click left mouse button to proceed

Correlations of quantitative vars:
ggcorrplot(corr,
method = "circle",
lab = TRUE,
# type = "lower",
outline.color = "white",
ggtheme = ggplot2::theme_gray,
colors = c("#6D9EC1", "white", "#E46726"))
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.

Questions: floors 1, 2, 3 as factors?
N/A in sqft_basement
Will removing outliers help with Residuals/Fitted values
str(house)
'data.frame': 21597 obs. of 15 variables:
$ price : num 221900 538000 180000 604000 510000 ...
$ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
$ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
$ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
$ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
$ floors : Factor w/ 3 levels "1","2","3": 1 2 1 1 1 1 2 1 1 2 ...
$ waterfront : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
$ view : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ condition : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
$ grade : Factor w/ 2 levels "0","1": 1 1 2 1 2 2 1 1 1 1 ...
$ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
$ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
$ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
$ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
$ age : num 66 30 88 56 34 20 26 58 61 18 ...
---
title: 'STAT 6021: Project 2'
author: "Connie Cui"
date: "11/26/2021"
output:
  html_document:
    df_print: paged
  html_notebook: default
---


Load in packages
```{r}
library(tidyverse)
library(ggplot2)
```
import data:
```{r}
house <- read.csv("house_data.csv")
head(house)
```
find missing data:
```{r}
# list rows of data that have missing values
house[!complete.cases(house),]
```
No missing data.
Check data types of each variable:
```{r}
str(house)
```
We will definitely need to change the data type for the date column, and potentially look into creating factors for some of the more ordinal variables.
```{r}
house$date = substr(house$date,1,nchar(house$date)-7)
head(house)
```
Convert date variabe to date type:
```{r}
house$date <- as.Date(house$date, "%Y%m%d")
head(house)
```
Turning view, condition, and grade into ordered factors:
```{r}
house$view <- factor(house$view, ordered = TRUE, levels = c(0, 1, 2, 3, 4))
house$condition <- factor(house$condition, ordered = TRUE, levels = c(1, 2, 3, 4, 5))
house$grade <- factor(house$grade, ordered = TRUE, levels = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15))
house$waterfront <- factor(house$waterfront, ordered = TRUE, levels = c(0, 1))
```


## Part 2: EDA

```{r}
#install.packages("ggcorrplot")
#install.packages("miscset")
#library(miscset)
#library(Hmisc)
library(tidyverse)
library(dplyr)
library(faraway)
library(gridExtra)
```

```{r}
names(house)
```

#### Prior to dropping Date and Geotags consider using them for plotting, for example transaction counts by dates?


```{r}
house <- subset(house, select=-c(id,num, date, zipcode, lat, long))
names(house)
```

```{r}
#describe(house)
```


Summary plots:
```{r}
sp1 <- ggplot(house, aes(x=sqft_living, y=price, color=waterfront))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with waterfornt indicator")

sp2 <- ggplot(house, aes(x=sqft_living, y=price, color=view))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with view indicator")
  
  
sp3 <- ggplot(house, aes(x=sqft_living, y=price, color=condition))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with condition indicator")
  
  
  
sp4 <- ggplot(house, aes(x=sqft_living, y=price, color=grade))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with grade indicator")

##produce the 4 density plots in a 2 by 2 matrix
grid.arrange(sp1, sp2, sp3, sp4, ncol = 2, nrow = 2)
```


#### Question: how to deal with indicator (ordinary) varibales in this case? Map to binary classes:

```{r}
cat_vars = c("waterfront", "view", "condition", "grade")
```

```{r}
ggplotGrid(ncol = 2,
  lapply(c("view", "waterfront", "condition", "grade"),
    function(col) {
        ggplot(house, aes_string(col)) + geom_bar() + coord_flip()
    }))
```

```{r}
bp1 <- ggplot(house, aes(x=waterfront, y=price))+
geom_boxplot()+
labs(x="waterfront", y="price", title="Price by waterfront")

bp2 <- ggplot(house, aes(x=view, y=price))+
geom_boxplot()+
labs(x="view", y="price", title="Price by view")

bp3 <- ggplot(house, aes(x=condition, y=price))+
geom_boxplot()+
labs(x="condition", y="price", title="Price by condition")

bp4 <- ggplot(house, aes(x=grade, y=price))+
geom_boxplot()+
labs(x="grade", y="price", title="Price by grade")

##produce the 4 density plots in a 2 by 2 matrix
grid.arrange(bp1, bp2, bp3, bp4, ncol = 2, nrow = 2)
```

#### Map to binary classes and check distributions and interactions

```{r}
# Changing `view` to 0 for regular view and 1 for every other view
house$view <- factor(ifelse(house$view!=0, 1, 0))
# Changing `condition` to 0 for everything below 3 and 1 otherwise
house$condition <- factor(ifelse(house$condition==1 | house$condition==2 | house$condition==3, 0, 1))
# Changing `grade` to 0 for everything below 7 and 1 otherwise
house$grade <- factor(ifelse(house$grade==1 | house$grade==2 | house$grade==3 |
                      house$grade==4 | house$grade==5 | house$grade==7 , 0, 1))
```


```{r}
ggplotGrid(ncol = 2,
  lapply(c("view", "waterfront", "condition", "grade"),
    function(col) {
        ggplot(house, aes_string(col)) + geom_bar() + coord_flip()
    }))
```

#### Checking possible interactions after mapping categorical variables to a larger classes


```{r}
sp1 <- ggplot(house, aes(x=sqft_living, y=price, color=waterfront))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with waterfornt indicator")

sp2 <- ggplot(house, aes(x=sqft_living, y=price, color=view))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with view indicator")
  
sp3 <- ggplot(house, aes(x=sqft_living, y=price, color=condition))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with condition indicator")
  
sp4 <- ggplot(house, aes(x=sqft_living, y=price, color=grade))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with grade indicator")

##produce the 4 density plots in a 2 by 2 matrix
grid.arrange(sp1, sp2, sp3, sp4, ncol = 2, nrow = 2)

```

#### Final check, same scatter plots but with log(price) - no visiable interaction with log price.

```{r}
sp1 <- ggplot(house, aes(x=sqft_living, y=log(price), color=waterfront))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with waterfornt indicator")

sp2 <- ggplot(house, aes(x=sqft_living, y=log(price), color=view))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with view indicator")
  
  
sp3 <- ggplot(house, aes(x=sqft_living, y=log(price), color=condition))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with condition indicator")
  
sp4 <- ggplot(house, aes(x=sqft_living, y=log(price), color=grade))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with grade indicator")

##produce the 4 density plots in a 2 by 2 matrix
grid.arrange(sp1, sp2, sp3, sp4, ncol = 2, nrow = 2)

```

#### Quantitative pridictors:

```{r}

quant_vars = c("yr_built", "yr_renovated",
               "floors", "bedrooms", "bathrooms", 
               "sqft_living", "sqft_lot", "sqft_above", "sqft_basement", 
               "sqft_living15", "sqft_lot15")

library(Hmisc)
hist.data.frame(house[,quant_vars])

```

#### Checking how many quantitative observations have 0 values

```{r}
colSums(house[,quant_vars] == 0)
```


#### Probably some homes have no basements and thus zeros sqft_basement is okay, but all homes are expected to have non-zero number of bedrooms (13 zeros) and bathrooms (10 zeroz). Drop these rows:


```{r}
house <- filter(house, bathrooms != 0, bedrooms != 0)
colSums(house[,quant_vars] == 0)
```


#### Converting quantitative predictor floors to a factor 1, 2, 3.


```{r}
house$floors <- factor(ifelse(house$floors < 2, 1, ifelse(house$floors < 3, 2, ifelse(house$floors>=3, 3, 0))))

sp_floors <- ggplot(house, aes(x=sqft_living, y=price, color=floors))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  labs(x="sqft_living", 
       y="price",
       title="Scatter plot of price against sqft_living with floors indicator")

bp_floors <- ggplot(house, aes(x=floors, y=price))+
  geom_boxplot()+
  labs(x="floors", y="price", title="Price by number of floors")

grid.arrange(sp_floors, bp_floors, ncol = 2, nrow = 1)
```


#### Computing age of the house and removing year_build and year_renovated

```{r}
house$age = ifelse(2021-house$yr_renovated >= 2021-house$yr_built, 2021-house$yr_built, 2021-house$yr_renovated)
head(house)
```

```{r}
house <- subset(house, select=-c(yr_renovated, yr_built))
names(house)
```


#### Final set of quantitative vars:

```{r}
quant_vars = c("age", "bedrooms", "bathrooms", 
               "sqft_living", "sqft_lot", "sqft_above", "sqft_basement", 
               "sqft_living15", "sqft_lot15")

hist.data.frame(house[,quant_vars])

```
               
#### Correlations of quantitative vars:

```{r}
corr <- round(cor(house[,c("price",quant_vars)]), 1)
library(ggcorrplot)
ggcorrplot(corr, 
           method = "circle", 
           lab = TRUE,
          # type = "lower", 
           outline.color = "white", 
           ggtheme = ggplot2::theme_gray,
           colors = c("#6D9EC1", "white", "#E46726"))
```





## TODO: interpreting models https://cran.r-project.org/web/packages/jtools/vignettes/summ.html

```{r}
fit <- lm(log(price) ~ . , data = house)
summary(fit)
```

```{r}
plot(fit)
```


```{r}
fit_les.sqft_basement <- lm(log(price) ~ . - sqft_basement, data = house)
summary(fit_les.sqft_basement)
```

```{r}
plot(fit_les.sqft_basement)
```

# Questions: floors 1, 2, 3 as factors?
# N/A in sqft_basement
# Will removing outliers help with Residuals/Fitted values

```{r}
str(house)
```








